Goto

Collaborating Authors

 big data pipeline


Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines

arXiv.org Artificial Intelligence

Efficient resource allocation is a key challenge in modern cloud computing. Over-provisioning leads to unnecessary costs, while under-provisioning risks performance degradation and SLA violations. This work presents an artificial intelligence approach to predict resource utilization in big data pipelines using Random Forest regression. We preprocess the Google Borg cluster traces to clean, transform, and extract relevant features (CPU, memory, usage distributions). The model achieves high predictive accuracy (R Square = 0.99, MAE = 0.0048, RMSE = 0.137), capturing non-linear relationships between workload characteristics and resource utilization. Error analysis reveals impressive performance on small-to-medium jobs, with higher variance in rare large-scale jobs. These results demonstrate the potential of AI-driven prediction for cost-aware autoscaling in cloud environments, reducing unnecessary provisioning while safeguarding service quality.


High-throughput Cotton Phenotyping Big Data Pipeline Lambda Architecture Computer Vision Deep Neural Networks

arXiv.org Artificial Intelligence

In this study, we propose a big data pipeline for cotton bloom detection using a Lambda architecture, which enables real-time and batch processing of data. Our proposed approach leverages Azure resources such as Data Factory, Event Grids, Rest APIs, and Databricks. This work is the first to develop and demonstrate the implementation of such a pipeline for plant phenotyping through Azure's cloud computing service. The proposed pipeline consists of data preprocessing, object detection using a YOLOv5 neural network model trained through Azure AutoML, and visualization of object detection bounding boxes on output images. The trained model achieves a mean Average Precision (mAP) score of 0.96, demonstrating its high performance for cotton bloom classification. We evaluate our Lambda architecture pipeline using 9000 images yielding an optimized runtime of 34 minutes. The results illustrate the scalability of the proposed pipeline as a solution for deep learning object detection, with the potential for further expansion through additional Azure processing cores. This work advances the scientific research field by providing a new method for cotton bloom detection on a large dataset and demonstrates the potential of utilizing cloud computing resources, specifically Azure, for efficient and accurate big data processing in precision agriculture.


How AIOps Conquers Performance Gaps on Big Data Pipelines - The New Stack

#artificialintelligence

If your data pipelines are growing in complexity and beyond the point where you can manage them, you're not alone. Today, they have become so massive and are crisscrossed by so many dependencies that it can be hard to see how all the components fit together, and hard to identify issues and opportunities that impact app performance and availability. Data stacks combine many disparate elements for data gathering and analysis, among other functions -- and exponential data growth in most organizations only adds to the challenge. In such an environment, simply monitoring performance and taking reactive measures when performance lags is no longer a viable approach. Today, with AIOps (Artificial Intelligence for IT Operations), a correlated data model helps you discover the full context of your apps and system resources so that you can adequately plan, manage, and improve performance.


Adding Stanford CoreNLP To Big Data Pipelines (Apache NiFi 1.1/HDF 2.1) Part 1 of 2 - Hortonworks

@machinelearnbot

The latest version of Stanford CoreNLP includes a server that you can run and access via REST API. CoreNLP adds a lot of features, but the one most interesting to me is Sentiment Analysis. This is big, it has models and all the JARS and server code. Giving the JVM Four Gigs of RAM to run makes it run nice. Port 9000 works for me.